Chinese Lipreading Network Based on Vision Transformer
XUE Feng1, HONG Zikun2, LI Shujie1, LI Yu2, XIE Yincen2
1. School of Software, Hefei University of Technology, Hefei 230601; 2. School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601
Abstract:Lipreading is a multimodal task to convert lipreading videos into text, and it is intended to understand the meaning expressed by a speaker in the absence of sound. In the existing lipreading methods, convolutional neural networks are adopted to extract visual features of the lips and capture short-distance pixel relationships, resulting in difficulties in distinguishing lip shapes of similarly pronounced characters. To capture the long-distance relationship between pixels in the lip region of the video images, an end-to-end Chinese sentence-level lipreading model based on vision transformer(ViT) is proposed. The ability of the model to extract visual spatio-temporal features from lip videos is improved by fusing ViT and Gate Recurrent Unit(GRU). Firstly, the global spatial features of lip images are extracted using the self-attention module of ViT. Then, GRU is employed to model the temporal sequence of frames. Finally, the cascading sequence-to-sequence model based on the attention mechanism is utilized to predict Chinese pinyin and Chinese character utterances. Experimental results on Chinese lipreading dataset CMLR show that the proposed model produces a lower Chinese character error rate.
[1] ASSAEL Y M, SHILLINGFORD B, WHITESON S, et al. LipNet: End-to-End Sentence-Level Lipreading[C/OL].[2022-07-07]. https://arxiv.org/pdf/1611.01599.pdf. [2] HUANG Y Y, LIANG X F, FANG C W. CALLip: Lipreading Using Contrastive and Attribute Learning // Proc of the 29th ACM International Conference on Multimedia. New York, USA: ACM, 2021: 2492-2500. [3] WENG X S, KITANI K. Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading // Proc of the 30th British Machine Vision Conference[C/OL]. [2022-07-07].https://arxiv.org/pdf/1905.02540v1.pdf. [4] JI S W, XU W, YANG M, et al. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231. [5] CHUNG J, GULCEHRE C, CHO K, ,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[C/OL]. [2022-07-07]. https://arxiv.org/pdf/1412.3555.pdf. [6] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist Tem-poral Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proc of the 23rd International Confe-rence on Machine Learning. New York, USA: ACM, 2006: 369-376. [7] CHUNG J S, SENIOR A, VINYALS O, et al. Lip Reading Sentences in the Wild // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 3444-3450. [8] SUTSKEVER I, VINYALS O, LE Q V. Sequence to Sequence Lear-ning with Neural Networks // Proc of the 27th International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2014: 3104-3112. [9] ZHANG T, HE L, LI X D, et al. Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks. Applied Sciences, 2021, 11(15): 6975-6987. [10] MA P C, MARTINEZ B, PETRIDIS S, et al. Towards Practical Lipreading with Distilled and Efficient Models // Proc of the IEEE International Conference on Acoustics, Speech and Signal Proce-ssing. Washington, USA: IEEE, 2021: 7608-7612. [11] MA P C, PETRIDIS S, PANTIC M. Visual Speech Recognition for Multiple Languages in the Wild. Nature Machine Intelligence, 2022, 4: 930-939. [12] ZHANG X B, GONG H G, DAI X L, et al. Understanding Pictograph with Facial Features: End-to-End Sentence-Level Lip Reading of Chinese. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 9211-9218. [13] ZHAO Y, XU R, SONG M L. A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading // Proc of the ACM Multimedia Asia. New York, USA: ACM, 2019. DOI: 10.1145/3338533.3366579. [14] DENG M H, XIONG S W. Phoneme-Based Lipreading of Silent Sentences // Proc of the IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers. Washington, USA: IEEE, 2022: 206-210. [15] CHUNG J S, ZISSERMAN A. Lip Reading in the Wild // Proc of the Asian Conference on Computer Vision. Berlin, Germany: Springer, 2016: 87-103. [16] STAFYLAKIS T, TZIMIROPOULOS G. Combining Residual Networks with LSTMs for Lipreading[C/OL]. [2022-07-07].https://arxiv.org/pdf/1703.04105.pdf. [17] XU K, LI D W, CASSIMATIS N, et al. LCANet: End-to-End Lipreading with Cascaded Attention-CTC // Proc of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Washington, USA: IEEE, 2018: 548-555. [18] JEON S, ELSHARKAWY A, KIM M S. Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Le-vel Visual Speech Recognition. Sensors, 2022, 22(1): 72-91. [19] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL].[2022-07-07]. https://arxiv.org/pdf/2010.11929.pdf. [20] ZAREMBA W, SUTSKEVER I, VINYALS O. Recurrent Neural Network Regularization[C/OL]. [2022-07-07]. https://arxiv.org/pdf/1409.2329.pdf. [21] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press 2017: 6000-6010. [22] HENDRYCKS D, GIMPEL K. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units[C/OL]. [2022-07-07].https://arxiv.org/pdf/1606.08415.pdf. [23] COOKE M, BARKER J, CUNNINGHAM S, et al. An Audio-Vi-sual Corpus for Speech Perception and Automatic Speech Recog-nition. The Journal of the Acoustical Society of America, 2006, 120(5): 2421-2424. [24] YANG S, ZHANG Y H, FENG D L, et al. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild // Proc of the 14th IEEE International Conference on Automatic Face and Gesture Recognition. Washington, USA: IEEE, 2019. DOI: 10.1109/FG.2019.8756582 [25] BENGIO S, VINYALS O, JAITLY N, et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks // Proc of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2015: 1171-1179. [26] ZHAO Y, XU R, WANG X C, et al. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 6917-6924. [27] JIA J W, WANG Z L, XU L H, et al. An Interference-Resistant and Low-Consumption Lip Recognition Method. Electronics, 2022, 11(19): 3066-3081. [28] FENG D L, YANG S, SHAN S G, et al. Learn an Effective Lip Reading Model without Pains[C/OL].[2022-07-07]. https://arxiv.org/pdf/2011.07557.pdf. [29] WANG H J, PU G Q, CHEN T Y. A Lip Reading Method Based on 3D Convolutional Vision Transformer. IEEE Access, 2022, 10: 77205-77212. [30] SIMONYAN K, VEDALDI A, ZISSERMAN A. Deep Inside Con-volutional Networks: Visualising Image Classification Models and Saliency Maps[C/OL]. [2022-07-07]. https://arxiv.org/abs/1312.6034.